# Lab 26 - k-Nearest Neighbors classifier 2

We will continue using the Titanic training and test data from [Kaggle](https://www.kaggle.com/c/titanic) from Lab 24 and 25.

First import the necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

### Loading and cleaning the training data

The code for loading in the data and cleaning it from Lab 25 is below.

In [None]:
train = pd.read_csv("../Data/train.csv")
train.head()

In [None]:
# fill in missing age data
train["Age"] = train["Age"].fillna(train["Age"].median())

# fill in the missing embarked data
train["Embarked"] = train["Embarked"].fillna("S")

In [None]:
# create dummy variables for passenger class, sex, and embarked
train2 = pd.get_dummies(train, columns = ["Pclass","Sex","Embarked"], drop_first = True)
train2.head()

In [None]:
# remove the remaining qualitative columns
train2.drop("Cabin",axis = 1,inplace = True)
train2.drop("Name",axis = 1,inplace = True)
train2.drop("Ticket",axis = 1,inplace = True)

# we should also drop PassengerId, although we did not last lab
train2.drop("PassengerId",axis = 1, inplace = True)

In [None]:
# split the original training data into training and sets sets
X_train,X_test,y_train, y_test =train_test_split(train2.drop("Survived",axis=1),train2["Survived"],test_size = 0.2)

In [None]:
# create a 3-nearest neighbor classifirer
knn = KNeighborsClassifier(n_neighbors=3)
# fit the classifier to the training data
knn.fit(X_train, y_train)
# test and score the classifier on our test data (part of the original training data)
knn.score(X_test, y_test)

Now we are going to try running our classifier on the Kaggle test data and use all of our training data to fit the classifier.

### Loading and cleaning the Kaggle test data

First, load the test data from Kaggle into the dataframe `test`.

We have to process the test data in the same way as the training data, namely filling in the missing age and embarked data, creating the dummy variables for Pclass, Sex, and Embarked, and dropping the Cabin, Name, and Ticket columns.  Do this below, adding as many extra cells as you need.

Call the processed dataframe `test2`.

Store the column `PassengerId` in a variable.  We need this information for submitting our predictions to Kaggle, but don't want to use it in making the pedictions.

Remove (drop) the `PassengerId` column from the test data.

<details> <summary>Pattern:</summary>
<code>df = df.drop("name_of_column_to_drop",axis=1)</code>
</details>

### Training the 3-nearest neighbor classifier

Next we split up our training data into the answer (the `Survived` column) and the input data (all other columns).

First, store the `Survived` column in the variable `y_train_kaggle`.

Next, remove (drop) the `Survived` column from the training data, and store the new data frame in the variable `X_train_kaggle`.

<details> <summary>Pattern:</summary>
<code>X_train_kaggle = df.drop("name_of_column_to_drop",axis=1)</code>
</details>

Create a new k-nearest neighbors object with k = 3, and fit it on the entire training data (`X_train_kaggle`).

In [None]:
knn_kaggle = KNeighborsClassifier(n_neighbors=3)
knn_kaggle.fit(X_train_kaggle,y_train_kaggle)

<details> <summary>Pattern:</summary>
<code>knn_classifier_var_name = KNeighborsClassifier(n_neighbors= k)
knn_classifier_var_name.fit(training_data,correct_output)
</code>
</details>

### Making predictions with our 3-nearest neighbor classifier

Now use this fitted classifier to make predictions on the test data (`test2`), and store it in the variable `y_pred`.

In [None]:
y_pred = knn_kaggle.predict(test2)

The error message says something about NaN.  Could there be missing data (NaN) is a different column in the test data?  Use the `describe()` function to see if this is the case.

The fare column is missing one value.  Fill it in with the median fare.

<details> <summary>Pattern:</summary>
Assuming the variable `median_fare` contains the median fare:
<code>df["Fare"] = df["Fare"].fillna(median_fare)
</code>
</details>

Now try making the prediction again.

Finally, we want to check our predictions by uploading them to Kaggle.  Kaggle wants the predictions in a CSV file with two columns:  PassengerID and Survived (our predictions).

First we will create a new dataframe containing these two columns.  The code below assumes you stored the passenger ID column in the variable `passengerID`, but you can change this to whatever variable name you used.

In [None]:
df = pd.DataFrame(data = {"PassengerId":passengerId, "Survived":y_pred})

Next create a new CSV containing the data in this dataframe:

In [None]:
df.to_csv("test1.csv",index = 0)

Can you see the CSV file in your directory?  Download it and try uploading it to Kaggle (you will have to create an account) to check your predictions.  [Submission site](https://www.kaggle.com/c/titanic/submit)